class: center, middle, inverse, title-slide # Databases ### Kirill & Nicolas ### cynkra GmbH ### March 15, 2022 --- <style type="text/css"> .pull-left { margin-top: -25px; } .pull-right { margin-top: -25px; } .remark-code { font-size: 14px; } pre { white-space: pre-wrap; /* css-3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* Internet Explorer 5.5+ */ } .font17 { font-size: 17px; } .font14 { font-size: 14px; } </style> # Introduction Organization of half-day R courses: - Intro courses: * Tidyverse intro I * Base R intro/Tidyverse intro II * Data visualization I * Data visualization II - Advanced courses: * Advanced tidyverse * R package creation * Working with database systems * Parallelization & efficient R programming * Databases (this course) --- # Course material Our course material currently is available from GitHub at https://github.com/cynkra/bag-courses Today we will be looking at the folder `5_databases` --- # Download course material  --- # General remarks - We hope for these courses to be interactive: go ahead and ask if something is unclear! - You can also write into the chat, which I will try to monitor when Kirill is presenting. - We were asked to provide recordings of the courses for those of you who cannot join, so recording is activated. - Per course unit, we offer 4 hours of follow up time; approach us with questions (nicolas@cynkra.com)! --- # Goals for today Playing the whole game! - Extract - Transform - Load - **Consume** --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - **Read whole tables** - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/11.webp") background-size: contain background-position: 100% 100% # Read whole tables .pull-left[ - {DBI} and {duckdb} package - Connect - Discover - Read - Query Script: `databases_11.R` ```r library(tidyverse) library(DBI) ``` ] --- background-image: url("data:image/png;base64,#images/11-frame.webp") background-size: contain background-position: 100% 100% # Read whole tables .pull-left[ - {DBI} package - Connect - Discover - Read - Query Script: `databases_11.R` ```r library(tidyverse) library(DBI) ``` ] --- # Connect to the database First step when accessing the database. ```r con_duckdb <- dbConnect(duckdb::duckdb()) con_duckdb ``` ``` <duckdb_connection 0faa0 driver=<duckdb_driver 22170 dbdir=':memory:' read_only=FALSE>> ``` --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Prepare database Normally done in a preparatory step. ```r dm::copy_dm_to( con_duckdb, dm::dm_pixarfilms(), set_key_constraints = FALSE, temporary = FALSE ) ``` --- # Discover tables Where is my data? ```r dbListTables(con_duckdb) ``` ``` [1] "academy" "box_office" "genres" [4] "pixar_films" "pixar_people" "public_response" ``` ```r dbListFields(con_duckdb, "pixar_films") ``` ``` [1] "number" "film" "release_date" "run_time" [5] "film_rating" ``` Caveat: schemas, catalogs, ... ```sql SELECT * FROM INFORMATION_SCHEMA.TABLES ``` --- # Read tables Read entire tables into your local session, if you can afford it. ```r df_pixar_films <- dbReadTable(con_duckdb, "pixar_films") df_pixar_films ``` ``` number film release_date run_time film_rating 1 1 Toy Story 1995-11-22 81 G 2 2 A Bug's Life 1998-11-25 95 G 3 3 Toy Story 2 1999-11-24 92 G 4 4 Monsters, Inc. 2001-11-02 92 G 5 5 Finding Nemo 2003-05-30 100 G 6 6 The Incredibles 2004-11-05 115 PG 7 7 Cars 2006-06-09 117 G 8 8 Ratatouille 2007-06-29 111 G 9 9 WALL-E 2008-06-27 98 G 10 10 Up 2009-05-29 96 PG 11 11 Toy Story 3 2010-06-18 103 G 12 12 Cars 2 2011-06-24 106 G 13 13 Brave 2012-06-22 93 PG 14 14 Monsters University 2013-06-21 104 G 15 15 Inside Out 2015-06-19 95 PG 16 16 The Good Dinosaur 2015-11-25 93 PG 17 17 Finding Dory 2016-06-17 97 PG 18 18 Cars 3 2017-06-16 102 G 19 19 Coco 2017-11-22 105 PG 20 20 Incredibles 2 2018-06-15 118 PG [ reached 'max' / getOption("max.print") -- omitted 7 rows ] ``` --- # Read tables Use `as_tibble()` to convert to a tibble for better display and more robust operation. ```r df_pixar_films <- dbReadTable(con_duckdb, "pixar_films") as_tibble(df_pixar_films) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- # Execute queries Write SQL code to define what data you want to see. ```r dbGetQuery(con_duckdb, "SELECT * FROM pixar_films") ``` ``` number film release_date run_time film_rating 1 1 Toy Story 1995-11-22 81 G 2 2 A Bug's Life 1998-11-25 95 G 3 3 Toy Story 2 1999-11-24 92 G 4 4 Monsters, Inc. 2001-11-02 92 G 5 5 Finding Nemo 2003-05-30 100 G 6 6 The Incredibles 2004-11-05 115 PG 7 7 Cars 2006-06-09 117 G 8 8 Ratatouille 2007-06-29 111 G 9 9 WALL-E 2008-06-27 98 G 10 10 Up 2009-05-29 96 PG 11 11 Toy Story 3 2010-06-18 103 G 12 12 Cars 2 2011-06-24 106 G 13 13 Brave 2012-06-22 93 PG 14 14 Monsters University 2013-06-21 104 G 15 15 Inside Out 2015-06-19 95 PG 16 16 The Good Dinosaur 2015-11-25 93 PG 17 17 Finding Dory 2016-06-17 97 PG 18 18 Cars 3 2017-06-16 102 G 19 19 Coco 2017-11-22 105 PG 20 20 Incredibles 2 2018-06-15 118 PG [ reached 'max' / getOption("max.print") -- omitted 7 rows ] ``` --- # Execute queries Write complex SQL code to define what data you want to see. ```r sql <- " SELECT * FROM pixar_films WHERE release_date >= '2020-01-01' " ``` ```r dbGetQuery(con_duckdb, sql) ``` ``` number film release_date run_time film_rating 1 22 Onward 2020-03-06 102 PG 2 23 Soul 2020-12-25 100 PG 3 24 Luca 2021-06-18 151 N/A 4 25 Turning Red 2022-03-11 NA N/A 5 26 Lightyear 2022-06-17 NA N/A 6 27 <NA> 2023-06-16 155 Not Rated ``` --- # Execute queries R 4.0 or later: use new-style string literals for mixing quotes. ```r sql <- r"( SELECT * FROM "pixar_films" WHERE "release_date" >= '2020-01-01' )" ``` ```r dbGetQuery(con_duckdb, sql) ``` ``` number film release_date run_time film_rating 1 22 Onward 2020-03-06 102 PG 2 23 Soul 2020-12-25 100 PG 3 24 Luca 2021-06-18 151 N/A 4 25 Turning Red 2022-03-11 NA N/A 5 26 Lightyear 2022-06-17 NA N/A 6 27 <NA> 2023-06-16 155 Not Rated ``` --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Further pointers .pull-left[ ## Quoting ```r dbQuoteIdentifier(con_duckdb, "academy") ``` ``` <SQL> "academy" ``` ```r dbQuoteLiteral(con_duckdb, "Toy Story") ``` ``` <SQL> 'Toy Story' ``` ```r dbQuoteLiteral(con_duckdb, as.Date("2020-01-01")) ``` ``` <SQL> '2020-01-01'::date ``` ```r glue::glue_sql(...) ``` ] .pull-right[ ## Parameterized queries ```r sql <- " SELECT count(*) FROM pixar_films WHERE release_date >= ? " dbGetQuery(con_duckdb, sql, params = list(as.Date("2020-01-01")) ) ``` ``` count_star() 1 6 ``` ] --- background-image: url("data:image/png;base64,#images/11-frame.webp") background-size: contain background-position: 100% 100% # Read whole tables: Exercises .pull-left[ 1. List all columns from the `box_office` table. 2. Read the `academy` table. 3. Read all records from the `academy` table that correspond to awards won - Hint: Use the query `"SELECT * FROM academy WHERE status = 'Won'"` 4. Use quoting and/or query parameters to stabilize the previous query. ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - **Let the database do the heavy lifting 1/2** - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/12.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database .pull-left[ - {dbplyr} package (part of the tidyverse) - Lazy tables - `collect()` - `select()` - `filter()` - `count()` Script: `databases_12_1.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/12-frame.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database .pull-left[ - {dbplyr} package (part of the tidyverse) - Lazy tables - `collect()` - `select()` - `filter()` - `group_by()`, `summarize()`, `ungroup()`, `count()` Script: `databases_12_1.R` ```r library(tidyverse) ``` ] --- # Lazy tables A pointer to a SQL table. The data is still on the database! ```r pixar_films <- tbl(con_duckdb, "pixar_films") pixar_films ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<pixar_films> [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> --- # Read the whole table ```r df_pixar_films <- pixar_films %>% collect() df_pixar_films ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- # Select columns With `select()`, like with data frames. .pull-left[ ```r pixar_films %>% select(1:3) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% select(1:3) %>% show_query() ``` ``` <SQL> SELECT "number", "film", "release_date" FROM "pixar_films" ``` ] --- # Select columns and read .pull-left[ ```r df_pixar_films_3 <- pixar_films %>% select(1:3) %>% collect() df_pixar_films_3 ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 3</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Select rows With `filter()`, like with data frames. .pull-left[ ```r pixar_films %>% filter(release_date >= "2020-01-01") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 22 Onward 2020-03-06 102 PG <span style='color: #BCBCBC;'>2</span> 23 Soul 2020-12-25 100 PG <span style='color: #BCBCBC;'>3</span> 24 Luca 2021-06-18 151 N/A <span style='color: #BCBCBC;'>4</span> 25 Turning Red 2022-03-11 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>5</span> 26 Lightyear 2022-06-17 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>6</span> 27 <span style='color: #BB0000;'>NA</span> 2023-06-16 155 Not Rated </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% filter(release_date >= "2020-01-01") %>% show_query() ``` ``` <SQL> SELECT * FROM "pixar_films" WHERE ("release_date" >= '2020-01-01') ``` ] --- # Select rows and read .pull-left[ ```r df_pixar_films_202x <- pixar_films %>% filter(release_date >= "2020-01-01") %>% collect() df_pixar_films_202x ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 6 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 22 Onward 2020-03-06 102 PG <span style='color: #BCBCBC;'>2</span> 23 Soul 2020-12-25 100 PG <span style='color: #BCBCBC;'>3</span> 24 Luca 2021-06-18 151 N/A <span style='color: #BCBCBC;'>4</span> 25 Turning Red 2022-03-11 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>5</span> 26 Lightyear 2022-06-17 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>6</span> 27 <span style='color: #BB0000;'>NA</span> 2023-06-16 155 Not Rated </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Aggregate With `group_by()` + `summarize()` + `ungroup()`, like with data frames. .pull-left[ ```r pixar_films %>% group_by(film_rating) %>% summarize(n = n()) %>% ungroup() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> G 13 <span style='color: #BCBCBC;'>2</span> PG 10 <span style='color: #BCBCBC;'>3</span> N/A 3 <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% group_by(film_rating) %>% summarize(n = n()) %>% ungroup() %>% show_query() ``` ``` <SQL> SELECT "film_rating", COUNT(*) AS "n" FROM "pixar_films" GROUP BY "film_rating" ``` ] --- # Aggregate With `count()`, like with data frames. .pull-left[ ```r pixar_films %>% # # count(film_rating) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> G 13 <span style='color: #BCBCBC;'>2</span> PG 10 <span style='color: #BCBCBC;'>3</span> N/A 3 <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% # # count(film_rating) %>% show_query() ``` ``` <SQL> SELECT "film_rating", COUNT(*) AS "n" FROM "pixar_films" GROUP BY "film_rating" ``` ] --- # Aggregate and read .pull-left[ ```r df_pixar_films_by_rating <- pixar_films %>% count(film_rating) %>% collect() df_pixar_films_by_rating ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 4 × 2</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> G 13 <span style='color: #BCBCBC;'>2</span> PG 10 <span style='color: #BCBCBC;'>3</span> N/A 3 <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- background-image: url("data:image/png;base64,#images/12-frame.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database: Exercises 1 * Find several ways to select the 3 first columns * What happens if you include the name of a variable multiple times in a `select()` call? * Select all columns that contain underscores (use `contains()`) * Use `all_of()` to select 2 columns of your choice --- background-image: url("data:image/png;base64,#images/12-frame.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database: Exercises 2 .pull-left[ Find all films that 1. Are rated "PG" 2. Had a run time below 95 3. Had a rating of "N/A" or "Not Rated" 4. Were released after and including year 2020 5. Have a missing name (`film` column) or `run_time` 6. Are a first sequel (the name ends with "2") - Hint: Bring the data into the R session before filtering ] --- background-image: url("data:image/png;base64,#images/12-frame.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database: Exercises 3 1. How many films are stored in the table? 2. How many films released after 2005 are stored in the table? 3. What is the total run time of all films? - Hint: Use `summarize(sum(...))`, watch out for the warning 4. What is the total run time of all films, per rating? - Hint: Use `group_by()` --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - **Let the database do the heavy lifting 2/2** - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/12_2.webp") background-size: 40% background-position: 100% 100% # Computing on the database .pull-left[ - {dbplyr} package (part of the tidyverse) - `mutate()` - `arrange()` - Grouped `summarize()` / `mutate()` Script: `databases_12_2.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/12_2-frame.webp") background-size: 40% background-position: 100% 100% # Computing on the database .pull-left[ - {dbplyr} package (part of the tidyverse) - `mutate()` - `arrange()` - Grouped `summarize()` / `mutate()` Script: `databases_12_2.R` ```r library(tidyverse) ``` ] --- # Transform With `mutate()`, like with data frames. .pull-left[ ```r pixar_films %>% select(film, release_date) %>% mutate(release_year = year(release_date)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>release_year</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> Toy Story 1995-11-22 <span style='text-decoration: underline;'>1</span>995 <span style='color: #BCBCBC;'> 2</span> A Bug's Life 1998-11-25 <span style='text-decoration: underline;'>1</span>998 <span style='color: #BCBCBC;'> 3</span> Toy Story 2 1999-11-24 <span style='text-decoration: underline;'>1</span>999 <span style='color: #BCBCBC;'> 4</span> Monsters, Inc. 2001-11-02 <span style='text-decoration: underline;'>2</span>001 <span style='color: #BCBCBC;'> 5</span> Finding Nemo 2003-05-30 <span style='text-decoration: underline;'>2</span>003 <span style='color: #BCBCBC;'> 6</span> The Incredibles 2004-11-05 <span style='text-decoration: underline;'>2</span>004 <span style='color: #BCBCBC;'> 7</span> Cars 2006-06-09 <span style='text-decoration: underline;'>2</span>006 <span style='color: #BCBCBC;'> 8</span> Ratatouille 2007-06-29 <span style='text-decoration: underline;'>2</span>007 <span style='color: #BCBCBC;'> 9</span> WALL-E 2008-06-27 <span style='text-decoration: underline;'>2</span>008 <span style='color: #BCBCBC;'>10</span> Up 2009-05-29 <span style='text-decoration: underline;'>2</span>009 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% select(film, release_date) %>% mutate(release_year = year(release_date)) %>% show_query() ``` ``` <SQL> SELECT "film", "release_date", EXTRACT(year FROM "release_date") AS "release_year" FROM "pixar_films" ``` ] --- # Transform and read .pull-left[ ```r df_pixar_films_with_release_year <- pixar_films %>% select(film, release_date) %>% mutate(release_year = year(release_date)) %>% collect() df_pixar_films_with_release_year ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 3</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>release_year</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> Toy Story 1995-11-22 <span style='text-decoration: underline;'>1</span>995 <span style='color: #BCBCBC;'> 2</span> A Bug's Life 1998-11-25 <span style='text-decoration: underline;'>1</span>998 <span style='color: #BCBCBC;'> 3</span> Toy Story 2 1999-11-24 <span style='text-decoration: underline;'>1</span>999 <span style='color: #BCBCBC;'> 4</span> Monsters, Inc. 2001-11-02 <span style='text-decoration: underline;'>2</span>001 <span style='color: #BCBCBC;'> 5</span> Finding Nemo 2003-05-30 <span style='text-decoration: underline;'>2</span>003 <span style='color: #BCBCBC;'> 6</span> The Incredibles 2004-11-05 <span style='text-decoration: underline;'>2</span>004 <span style='color: #BCBCBC;'> 7</span> Cars 2006-06-09 <span style='text-decoration: underline;'>2</span>006 <span style='color: #BCBCBC;'> 8</span> Ratatouille 2007-06-29 <span style='text-decoration: underline;'>2</span>007 <span style='color: #BCBCBC;'> 9</span> WALL-E 2008-06-27 <span style='text-decoration: underline;'>2</span>008 <span style='color: #BCBCBC;'>10</span> Up 2009-05-29 <span style='text-decoration: underline;'>2</span>009 <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Aggregate (complex) With `group_by()` + `summarize()` + `ungroup()`, like with data frames. .pull-left[ ```r pixar_films %>% group_by(film_rating) %>% summarize(mean_run_time = mean(run_time)) %>% ungroup() ``` ``` Warning: Missing values are always removed in SQL. Use `mean(x, na.rm = TRUE)` to silence this warning This warning is displayed only once per session. ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>mean_run_time</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> G 100. <span style='color: #BCBCBC;'>2</span> PG 101. <span style='color: #BCBCBC;'>3</span> N/A 151 <span style='color: #BCBCBC;'>4</span> Not Rated 155 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% group_by(film_rating) %>% summarize(mean_run_time = mean(run_time)) %>% ungroup() %>% show_query() ``` ``` <SQL> SELECT "film_rating", AVG("run_time") AS "mean_run_time" FROM "pixar_films" GROUP BY "film_rating" ``` ] --- # Transform and aggregate With `count()`, like with data frames. .pull-left[ ```r pixar_films %>% count(release_year = year(release_date)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>release_year</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> <span style='text-decoration: underline;'>1</span>995 1 <span style='color: #BCBCBC;'> 2</span> <span style='text-decoration: underline;'>1</span>998 1 <span style='color: #BCBCBC;'> 3</span> <span style='text-decoration: underline;'>1</span>999 1 <span style='color: #BCBCBC;'> 4</span> <span style='text-decoration: underline;'>2</span>001 1 <span style='color: #BCBCBC;'> 5</span> <span style='text-decoration: underline;'>2</span>003 1 <span style='color: #BCBCBC;'> 6</span> <span style='text-decoration: underline;'>2</span>004 1 <span style='color: #BCBCBC;'> 7</span> <span style='text-decoration: underline;'>2</span>006 1 <span style='color: #BCBCBC;'> 8</span> <span style='text-decoration: underline;'>2</span>007 1 <span style='color: #BCBCBC;'> 9</span> <span style='text-decoration: underline;'>2</span>008 1 <span style='color: #BCBCBC;'>10</span> <span style='text-decoration: underline;'>2</span>009 1 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% mutate(release_year = year(release_date)) %>% count(release_year) %>% show_query() ``` ``` <SQL> SELECT "release_year", COUNT(*) AS "n" FROM (SELECT "number", "film", "release_date", "run_time", "film_rating", EXTRACT(year FROM "release_date") AS "release_year" FROM "pixar_films") "q01" GROUP BY "release_year" ``` ] --- # In-place aggregate (window functions) With `add_count()`, like with data frames. .pull-left[ ```r pixar_films %>% add_count(release_year = year(release_date)) %>% filter(n > 1) %>% select(film, release_year, n) %>% arrange(release_date) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Groups: release_year</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_year</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> Inside Out <span style='text-decoration: underline;'>2</span>015 2 <span style='color: #BCBCBC;'>2</span> The Good Dinosaur <span style='text-decoration: underline;'>2</span>015 2 <span style='color: #BCBCBC;'>3</span> Cars 3 <span style='text-decoration: underline;'>2</span>017 2 <span style='color: #BCBCBC;'>4</span> Coco <span style='text-decoration: underline;'>2</span>017 2 <span style='color: #BCBCBC;'>5</span> Onward <span style='text-decoration: underline;'>2</span>020 2 <span style='color: #BCBCBC;'>6</span> Soul <span style='text-decoration: underline;'>2</span>020 2 <span style='color: #BCBCBC;'>7</span> Turning Red <span style='text-decoration: underline;'>2</span>022 2 <span style='color: #BCBCBC;'>8</span> Lightyear <span style='text-decoration: underline;'>2</span>022 2 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% add_count(release_year = year(release_date)) %>% filter(n > 1) %>% arrange(release_date) %>% select(film, release_year, n) %>% show_query() ``` ``` <SQL> SELECT "film", "release_year", "n" FROM (SELECT "number", "film", "release_date", "run_time", "film_rating", "release_year", COUNT(*) OVER (PARTITION BY "release_year") AS "n" FROM (SELECT "number", "film", "release_date", "run_time", "film_rating", EXTRACT(year FROM "release_date") AS "release_year" FROM "pixar_films") "q01") "q02" WHERE ("n" > 1.0) ORDER BY "release_date" ``` ] --- background-image: url("data:image/png;base64,#images/12_2-frame.webp") background-size: 40% background-position: 100% 100% # Computing on the database: Exercises .pull-left[ 1. Add new columns `release_year` and `release_month`. 2. Use the new columns to compute the number of months since January 1970 for each film 3. Compute the overall median run time, and the median run time per film rating 4. For each film except the last, compute how many days have passed until the next film. - Hint: Use `lag(..., order_by = ...)` ] .pull-right[ 5. Find the maximum number of days between releases of two G and two PG films. ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - **Basic ETL for one table** - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/13.webp") background-size: 40% background-position: 100% 100% # Extract, Transform, Load .pull-left[ - Obtain raw data - Prepare database table - Write it to a new database - Consume Script: `databases_13.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/13-frame.webp") background-size: 40% background-position: 100% 100% # Extract, Transform, Load .pull-left[ - Obtain raw data - Prepare database table - Write it to a new database - Consume Script: `databases_13.R` ```r library(tidyverse) ``` ] --- # Extract: Raw data From arbitrary source: CSV, Excel, web API, other database, ... ```r pixar_films_raw <- pixarfilms::pixar_films pixar_films_raw ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Transform: Derived data - Fix type of `number` column - Extract `franchise` and `sequel` columns ```r pixar_films_clean ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 7</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>franchise</span> <span style='font-weight: bold;'>sequel</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life A Bug's Life <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. Monsters, Inc. <span style='color: #BB0000;'>NA</span> 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo Finding Nemo <span style='color: #BB0000;'>NA</span> 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles The Incredibles <span style='color: #BB0000;'>NA</span> 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars Cars 1 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille Ratatouille <span style='color: #BB0000;'>NA</span> 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E WALL-E <span style='color: #BB0000;'>NA</span> 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up Up <span style='color: #BB0000;'>NA</span> 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- # Create target database Depending on workflow, often an existing database is used. ```r db_path <- fs::path_abs("pixar.duckdb") db_path ``` ``` /Users/kirill/git/cynkra/bag/bag-courses/5_databases/pixar.duckdb ``` ```r fs::file_delete(db_path) con <- dbConnect(duckdb::duckdb(dbdir = db_path)) con ``` ``` <duckdb_connection b1490 driver=<duckdb_driver b5870 dbdir='/Users/kirill/git/cynkra/bag/bag-courses/5_databases/pixar.duckdb' read_only=FALSE>> ``` --- # Load: Write table to the database ```r dbWriteTable(con, "pixar_films", pixar_films_clean) dbExecute(con, "CREATE UNIQUE INDEX pixarfilms_pk ON pixar_films (film)") ``` ``` [1] 0 ``` ```r nrow(dbReadTable(con, "pixar_films")) ``` ``` [1] 27 ``` ```r dbDisconnect(con) ``` --- # Consume: share the file, open it .pull-left[ ```r fs::dir_info() %>% arrange(desc(birth_time)) %>% head(2) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 2 × 18</span> <span style='font-weight: bold;'>path</span> <span style='font-weight: bold;'>type</span> <span style='font-weight: bold;'>size</span> <span style='font-weight: bold;'>permissi…</span> <span style='font-weight: bold;'>modification_time</span> <span style='font-weight: bold;'>user</span> <span style='font-weight: bold;'>group</span> <span style='color: #949494; font-style: italic;'><fs::path></span> <span style='color: #949494; font-style: italic;'><fct></span> <span style='color: #949494; font-style: italic;'><fs::></span> <span style='color: #949494; font-style: italic;'><fs::per></span> <span style='color: #949494; font-style: italic;'><dttm></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> …uckdb.wal file 1.84K rw-r--r-- 2022-03-14 <span style='color: #949494;'>21:16:33</span> kiri… staff <span style='color: #BCBCBC;'>2</span> …ar.duckdb file 12K rw-r--r-- 2022-03-14 <span style='color: #949494;'>21:16:33</span> kiri… staff <span style='color: #949494;'># … with 11 more variables: </span><span style='color: #949494; font-weight: bold;'>device_id</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>hard_links</span><span style='color: #949494;'> <dbl>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>special_device_id</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>inode</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>block_size</span><span style='color: #949494;'> <dbl>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>blocks</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>flags</span><span style='color: #949494;'> <int>, </span><span style='color: #949494; font-weight: bold;'>generation</span><span style='color: #949494;'> <dbl>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>access_time</span><span style='color: #949494;'> <dttm>, </span><span style='color: #949494; font-weight: bold;'>change_time</span><span style='color: #949494;'> <dttm>, </span><span style='color: #949494; font-weight: bold;'>birth_time</span><span style='color: #949494;'> <dttm></span> </CODE></PRE> ] .pull-right[ ```r con <- dbConnect(duckdb::duckdb(dbdir = db_path, read_only = TRUE)) my_pixar_films <- tbl(con, "pixar_films") my_pixar_films ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<pixar_films> [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>franchise</span> <span style='font-weight: bold;'>sequel</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life A Bug's … <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. Monsters… <span style='color: #BB0000;'>NA</span> 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo Finding … <span style='color: #BB0000;'>NA</span> 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles The Incr… <span style='color: #BB0000;'>NA</span> 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars Cars 1 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille Ratatoui… <span style='color: #BB0000;'>NA</span> 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E WALL-E <span style='color: #BB0000;'>NA</span> 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up Up <span style='color: #BB0000;'>NA</span> 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- background-image: url("data:image/png;base64,#images/13-frame.webp") background-size: 40% background-position: 100% 100% # Exercises 1. Adapt the ETL workflow to convert the `run_time` column to a duration. - Hint: Use `mutate()` with `hms::hms(minutes = ...)` . 2. Re-run the workflow. --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - **Subtle issues to watch out for** ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/14.webp") background-size: 40% background-position: 100% 100% # Caveats .pull-left[ - Lazy tables vs. data frames/tibbles - Order - Logical/Boolean data type - Imperfect translation - Materialization Script: `databases_14.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/14-frame.webp") background-size: 40% background-position: 100% 100% # Caveats .pull-left[ - Differences between database backends - Lazy tables vs. data frames/tibbles - Order - Logical/Boolean data type - Imperfect translation - Materialization Script: `databases_14.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Prepare SQLite database Very similar to DuckDB, subtle differences. ```r con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), extended_types = TRUE) dm::copy_dm_to( con_sqlite, dm::dm_pixarfilms(), set_key_constraints = FALSE, temporary = FALSE ) ``` --- # SQLite table ```r pixar_films_sqlite <- tbl(con_sqlite, "pixar_films") pixar_films_sqlite ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<pixar_films> [?? x 5]</span> <span style='color: #949494;'># Database: sqlite 3.37.2 []</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> --- # Lazy tables are not data frames The bracket operators `[` and `[[` don't work. .pull-left[ ## Bad ```r pixar_films[c("film", "film_rating")] ``` ``` $<NA> NULL $<NA> NULL ``` ] .pull-right[ ## Good ```r pixar_films %>% select(film, film_rating) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story G <span style='color: #BCBCBC;'> 2</span> A Bug's Life G <span style='color: #BCBCBC;'> 3</span> Toy Story 2 G <span style='color: #BCBCBC;'> 4</span> Monsters, Inc. G <span style='color: #BCBCBC;'> 5</span> Finding Nemo G <span style='color: #BCBCBC;'> 6</span> The Incredibles PG <span style='color: #BCBCBC;'> 7</span> Cars G <span style='color: #BCBCBC;'> 8</span> Ratatouille G <span style='color: #BCBCBC;'> 9</span> WALL-E G <span style='color: #BCBCBC;'>10</span> Up PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Lazy tables are not data frames Concatenation requires `union_all()`. .pull-left[ ## Bad ```r try( bind_rows(pixar_films, pixar_films) ) ``` <PRE class="fansi fansi-output"><CODE>Error in bind_rows(pixar_films, pixar_films) : Argument 1 must be a data frame or a named atomic vector. </CODE></PRE> ] .pull-right[ ## Good ```r union_all(pixar_films, pixar_films) %>% arrange(release_date) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 4</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 5</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 6</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 7</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 8</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 9</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'>10</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Lazy tables are not data frames The bracket operator `[` doesn't work for row subsetting. .pull-left[ ## Bad ```r try( pixar_films[1:3, ] ) ``` ``` Error in pixar_films[1:3, ] : incorrect number of dimensions ``` ] .pull-right[ ## Still bad ```r df_pixar_films <- pixar_films %>% collect() df_pixar_films[1:3, ] ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 3 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'>2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'>3</span> 3 Toy Story 2 1999-11-24 92 G </CODE></PRE> ] --- # Lazy tables are set-oriented No intrinsic order, `slice()` doesn't work. .pull-left[ ## Bad ```r try( pixar_films %>% slice(1:3) ) ``` ``` Error in slice(., 1:3) : slice() is not supported on database backends ``` ] .pull-right[ ## Better ```r pixar_films %>% filter(between(row_number(), 1, 3)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'>2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'>3</span> 3 Toy Story 2 1999-11-24 92 G </CODE></PRE> ] --- # Lazy tables are set-oriented Introduce row numbering via `row_number()` (based on a specific order) as a replacement for `slice()`. .pull-left[ ```r pixar_films %>% filter(between(row_number(release_date), 1, 3)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'>2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'>3</span> 3 Toy Story 2 1999-11-24 92 G </CODE></PRE> ] .pull-right[ ```r pixar_films %>% dbplyr::window_order(release_date) %>% filter(between(row_number(), 1, 3)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'>2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'>3</span> 3 Toy Story 2 1999-11-24 92 G </CODE></PRE> ] --- # Lazy tables are set-oriented Use the `order_by` argument or `dbplyr::window_order()` (but not `arrange()`) to set the order in `cumsum()`, `lag()` etc.. .pull-left[ ## Bad ```r try( pixar_films %>% group_by(film_rating) %>% mutate(total_run_time = cumsum(run_time)) %>% ungroup() ) ``` ``` Warning: Windowed expression 'SUM("run_time")' does not have explicit order. Please use arrange() or window_order() to make determinstic. ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>total_r…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G 81 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G 176 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G 268 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G 360 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G 460 <span style='color: #BCBCBC;'> 6</span> 7 Cars 2006-06-09 117 G 577 <span style='color: #BCBCBC;'> 7</span> 8 Ratatouille 2007-06-29 111 G 688 <span style='color: #BCBCBC;'> 8</span> 9 WALL-E 2008-06-27 98 G 786 <span style='color: #BCBCBC;'> 9</span> 11 Toy Story 3 2010-06-18 103 G 889 <span style='color: #BCBCBC;'>10</span> 12 Cars 2 2011-06-24 106 G 995 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Good ```r pixar_films %>% group_by(film_rating) %>% dbplyr::window_order(release_date) %>% mutate(total_run_time = cumsum(run_time)) %>% ungroup() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>total_r…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G 81 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G 176 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G 268 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G 360 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G 460 <span style='color: #BCBCBC;'> 6</span> 7 Cars 2006-06-09 117 G 577 <span style='color: #BCBCBC;'> 7</span> 8 Ratatouille 2007-06-29 111 G 688 <span style='color: #BCBCBC;'> 8</span> 9 WALL-E 2008-06-27 98 G 786 <span style='color: #BCBCBC;'> 9</span> 11 Toy Story 3 2010-06-18 103 G 889 <span style='color: #BCBCBC;'>10</span> 12 Cars 2 2011-06-24 106 G 995 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Lazy tables are set-oriented Use `arrange()` to fix the order, but only at the end of a query! .pull-left[ ## Bad ```r pixar_films %>% arrange(film) %>% mutate(run_time_hr = run_time / 60) %>% filter(run_time_hr < 2) ``` <PRE class="fansi fansi-warning"><CODE>Warning: ORDER BY is ignored in subqueries without LIMIT <span style='color: #0000BB;'>ℹ</span> Do you need to move arrange() later in the pipeline or use window_order() instead? </CODE></PRE><PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: film</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>run_ti…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G 1.35 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G 1.58 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G 1.53 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G 1.53 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G 1.67 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG 1.92 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G 1.95 <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G 1.85 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G 1.63 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG 1.6 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Good ```r pixar_films %>% mutate(run_time_hr = run_time / 60) %>% filter(run_time_hr < 2) %>% arrange(film) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: film</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>run_time…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 2 A Bug's Life 1998-11-25 95 G 1.58 <span style='color: #BCBCBC;'> 2</span> 13 Brave 2012-06-22 93 PG 1.55 <span style='color: #BCBCBC;'> 3</span> 7 Cars 2006-06-09 117 G 1.95 <span style='color: #BCBCBC;'> 4</span> 12 Cars 2 2011-06-24 106 G 1.77 <span style='color: #BCBCBC;'> 5</span> 18 Cars 3 2017-06-16 102 G 1.7 <span style='color: #BCBCBC;'> 6</span> 19 Coco 2017-11-22 105 PG 1.75 <span style='color: #BCBCBC;'> 7</span> 17 Finding Dory 2016-06-17 97 PG 1.62 <span style='color: #BCBCBC;'> 8</span> 5 Finding Nemo 2003-05-30 100 G 1.67 <span style='color: #BCBCBC;'> 9</span> 20 Incredibles 2 2018-06-15 118 PG 1.97 <span style='color: #BCBCBC;'>10</span> 15 Inside Out 2015-06-19 95 PG 1.58 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Sort order of missing values `NA` sort last in data frames, but first on databases. .pull-left[ ```r pixar_films %>% arrange(run_time) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<pixar_films> [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: run_time</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 25 Turning Red 2022-03-11 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'> 2</span> 26 Lightyear 2022-06-17 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'> 3</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 5</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 6</span> 13 Brave 2012-06-22 93 PG <span style='color: #BCBCBC;'> 7</span> 16 The Good Dinosaur 2015-11-25 93 PG <span style='color: #BCBCBC;'> 8</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> 15 Inside Out 2015-06-19 95 PG <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ```r pixar_films %>% arrange(is.na(run_time), run_time) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<pixar_films> [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: is.na(run_time), run_time</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 3</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 4</span> 13 Brave 2012-06-22 93 PG <span style='color: #BCBCBC;'> 5</span> 16 The Good Dinosaur 2015-11-25 93 PG <span style='color: #BCBCBC;'> 6</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 7</span> 15 Inside Out 2015-06-19 95 PG <span style='color: #BCBCBC;'> 8</span> 10 Up 2009-05-29 96 PG <span style='color: #BCBCBC;'> 9</span> 17 Finding Dory 2016-06-17 97 PG <span style='color: #BCBCBC;'>10</span> 9 WALL-E 2008-06-27 98 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Logical/Boolean data type Support differs across databases. .pull-left[ ## DuckDB ```r pixar_films %>% count(film_rating == "PG") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>`film_rating == "PG"`</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><lgl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> FALSE 17 <span style='color: #BCBCBC;'>2</span> TRUE 10 </CODE></PRE> ] .pull-right[ ## SQLite ```r pixar_films_sqlite %>% count(film_rating == "PG") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: sqlite 3.37.2 []</span> <span style='font-weight: bold;'>`film_rating == "PG"`</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #BCBCBC;'>1</span> 0 17 <span style='color: #BCBCBC;'>2</span> 1 10 </CODE></PRE> ] --- # Logical/Boolean data type Use `if_else()` or `case_when()` if in doubt. .pull-left[ ## SQLite ```r pixar_films_sqlite %>% count(film_rating == "PG") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: sqlite 3.37.2 []</span> <span style='font-weight: bold;'>`film_rating == "PG"`</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #BCBCBC;'>1</span> 0 17 <span style='color: #BCBCBC;'>2</span> 1 10 </CODE></PRE> ] .pull-right[ ## SQL Server (universal) ```r pixar_films_sqlite %>% count(if_else(film_rating == "PG", 1L, 0L)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: sqlite 3.37.2 []</span> <span style='font-weight: bold;'>`if_else(film_rating == "PG", 1L, 0L)`</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #BCBCBC;'>1</span> 0 17 <span style='color: #BCBCBC;'>2</span> 1 10 </CODE></PRE> ] --- # Imperfect translations Numeric values always use a decimal point. .pull-left[ ## Numeric ```r pixar_films %>% filter(run_time < 120) %>% show_query() ``` ``` <SQL> SELECT * FROM "pixar_films" WHERE ("run_time" < 120.0) ``` ] .pull-right[ ## Integer ```r pixar_films %>% filter(run_time < 120L) %>% show_query() ``` ``` <SQL> SELECT * FROM "pixar_films" WHERE ("run_time" < 120) ``` ] --- # Imperfect translations Not all R functions are supported. .pull-left[ ```r try( pixar_films %>% filter(grepl("^Toy", film)) %>% print() ) ``` ``` Error : duckdb_prepare_R: Failed to prepare query SELECT * FROM "pixar_films" WHERE (grepl('^Toy', "film")) LIMIT 11 Error: Catalog Error: Scalar Function with name grepl does not exist! Did you mean "ceil"? LINE 3: WHERE (grepl('^Toy', "film")) ^ ``` ] .pull-right[ ```r try( pixar_films %>% filter(str_detect(film, "^Toy")) %>% print() ) ``` ``` Error : str_detect() is not available in this SQL variant ``` ] --- # Imperfect translations Unknown functions and operators are passed on verbatim. .pull-left[ ## Lazy table ```r pixar_films %>% filter(film %LIKE% "Toy%") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'>2</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'>3</span> 11 Toy Story 3 2010-06-18 103 G <span style='color: #BCBCBC;'>4</span> 21 Toy Story 4 2019-06-21 100 G </CODE></PRE> ] .pull-right[ ## Data frame ```r try( pixar_films %>% collect() %>% filter(film %LIKE% "Toy%") ) ``` <PRE class="fansi fansi-output"><CODE>Error in filter(., film %LIKE% "Toy%") : Problem while computing `..1 = film %LIKE% "Toy%"`. <span style='font-weight: bold;'>Caused by error in `film %LIKE% "Toy%"`:</span> <span style='color: #BBBB00;'>!</span> could not find function "%LIKE%" </CODE></PRE> ] --- # Imperfect translations Unknown functions and operators are passed on verbatim. .pull-left[ ## Lazy table ```r pixar_films %>% summarize(MAX(run_time)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 1]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>`MAX(run_time)`</span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> 155 </CODE></PRE> ] .pull-right[ ## Data frame ```r try( pixar_films %>% collect() %>% summarize(MAX(run_time)) ) ``` <PRE class="fansi fansi-output"><CODE>Error in summarize(., MAX(run_time)) : Problem while computing `..1 = MAX(run_time)`. <span style='font-weight: bold;'>Caused by error in `MAX()`:</span> <span style='color: #BBBB00;'>!</span> could not find function "MAX" </CODE></PRE> ] --- # Imperfect translations Aggregation functions don't support `na.rm = FALSE` (the default in R). .pull-left[ ## Warning ```r pixar_films %>% summarize(total_run_time = sum(run_time)) %>% show_query() ``` ``` Warning: Missing values are always removed in SQL. Use `SUM(x, na.rm = TRUE)` to silence this warning This warning is displayed only once per session. ``` ``` <SQL> SELECT SUM("run_time") AS "total_run_time" FROM "pixar_films" ``` ] .pull-right[ ## Hackaround ```r pixar_films %>% summarize(total_run_time = SUM(run_time)) %>% show_query() ``` ``` <SQL> SELECT SUM("run_time") AS "total_run_time" FROM "pixar_films" ``` ] --- # Imperfect translations Escape hatch: `sql()` to pass through arbitrary SQL. .pull-left[ ## Fine-grained ```r pixar_films %>% mutate(number = CAST(number %AS% sql("integer"))) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Everything ```r pixar_films %>% mutate(number = CAST(number %AS% sql("integer"))) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Imperfect translations On the up side: `sql()` is rarely needed. .pull-left[ ## Manual ```r pixar_films %>% transmute(number = CAST(number %AS% sql("integer"))) %>% show_query() ``` ``` <SQL> SELECT CAST("number" AS integer) AS "number" FROM "pixar_films" ``` ] .pull-right[ ## Translated ```r pixar_films %>% transmute(number = as.integer(number)) %>% show_query() ``` ``` <SQL> SELECT CAST("number" AS INTEGER) AS "number" FROM "pixar_films" ``` ] --- # Lazy tables are recomputed on every access Saving to a variable doesn't help. .pull-left[ ```r films_two_per_year <- pixar_films %>% transmute(film, y = year(release_date)) %>% add_count(y) %>% filter(n > 1) films_two_per_year %>% show_query() ``` ``` <SQL> SELECT * FROM (SELECT "film", "y", COUNT(*) OVER (PARTITION BY "y") AS "n" FROM (SELECT "film", EXTRACT(year FROM "release_date") AS "y" FROM "pixar_films") "q01") "q02" WHERE ("n" > 1.0) ``` ] .pull-right[ ```r films_two_per_year %>% arrange(release_date) %>% show_query() ``` ``` <SQL> SELECT * FROM (SELECT "film", "y", COUNT(*) OVER (PARTITION BY "y") AS "n" FROM (SELECT "film", EXTRACT(year FROM "release_date") AS "y" FROM "pixar_films") "q01") "q02" WHERE ("n" > 1.0) ORDER BY "release_date" ``` ] --- # Materialize temporary results Remedy: materialize with `compute()`. Requires write access to the database (at least for temporary tables). .pull-left[ ```r films_two_per_year_mat <- films_two_per_year %>% compute(unique_indexes = list(c("film"))) films_two_per_year_mat ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<dbplyr_001> [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Groups: y</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>y</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> Turning Red <span style='text-decoration: underline;'>2</span>022 2 <span style='color: #BCBCBC;'>2</span> Lightyear <span style='text-decoration: underline;'>2</span>022 2 <span style='color: #BCBCBC;'>3</span> Onward <span style='text-decoration: underline;'>2</span>020 2 <span style='color: #BCBCBC;'>4</span> Soul <span style='text-decoration: underline;'>2</span>020 2 <span style='color: #BCBCBC;'>5</span> Cars 3 <span style='text-decoration: underline;'>2</span>017 2 <span style='color: #BCBCBC;'>6</span> Coco <span style='text-decoration: underline;'>2</span>017 2 <span style='color: #BCBCBC;'>7</span> Inside Out <span style='text-decoration: underline;'>2</span>015 2 <span style='color: #BCBCBC;'>8</span> The Good Dinosaur <span style='text-decoration: underline;'>2</span>015 2 </CODE></PRE> ] .pull-right[ ```r films_two_per_year_mat %>% arrange(release_date) %>% show_query() ``` ``` <SQL> SELECT * FROM "dbplyr_001" ORDER BY "release_date" ``` ] --- background-image: url("data:image/png;base64,#images/14-frame.webp") background-size: 40% background-position: 100% 100% # Exercises 1. Experiment. --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - **Joins** - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/21.webp") background-size: 40% background-position: 100% 100% # Joins .pull-left[ - Usage - Join sources - Mounting Script: `databases_21.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/21-frame.webp") background-size: 40% background-position: 100% 100% # Joins .pull-left[ - Usage - Join sources - Mounting Script: `databases_21.R` ```r library(tidyverse) ``` ] --- # A second table .pull-left[ ```r academy <- tbl(con_duckdb, "academy") academy_sqlite <- tbl(con_sqlite, "academy") academy %>% count(status) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> Award not yet introduced 3 <span style='color: #BCBCBC;'>2</span> Nominated 36 <span style='color: #BCBCBC;'>3</span> Ineligible 23 <span style='color: #BCBCBC;'>4</span> Won Special Achievement 1 <span style='color: #BCBCBC;'>5</span> Won 17 </CODE></PRE> ] .pull-right[ ```r academy ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<academy> [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Feature Award not yet introduced <span style='color: #BCBCBC;'> 2</span> Toy Story Original Screenplay Nominated <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Screenplay Ineligible <span style='color: #BCBCBC;'> 4</span> Toy Story Original Score Nominated <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nominated <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won Special Achievement <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Feature Award not yet introduced <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Screenplay Ineligible <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Score Nominated <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Feature Award not yet introduced <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Left join The most frequent kind of join. .pull-left[ ## Unsafe ```r academy %>% left_join(pixar_films) ``` <PRE class="fansi fansi-message"><CODE>Joining, by = "film" </CODE></PRE><PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Fe… Award… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Sc… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Scr… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Sc… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original So… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Fe… Award… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Scr… Ineli… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Sc… Nomin… 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Fe… Award… 3 1999-11-24 92 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Better ```r academy %>% left_join(pixar_films, by = "film") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Fe… Award… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Sc… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Scr… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Sc… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original So… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Fe… Award… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Scr… Ineli… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Sc… Nomin… 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Fe… Award… 3 1999-11-24 92 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Left join Computed on the database, original data unchanged. ```r academy %>% left_join(pixar_films, by = "film") %>% show_query() ``` ``` <SQL> SELECT "LHS"."film" AS "film", "award_type", "status", "number", "release_date", "run_time", "film_rating" FROM "academy" AS "LHS" LEFT JOIN "pixar_films" AS "RHS" ON ("LHS"."film" = "RHS"."film") ``` --- # Join with preparation The right-hand side in the join should come from a variable. .pull-left[ ## Prepare RHS ```r academy_won <- academy %>% filter(status == "Won") %>% count(film, name = "n_won") academy_won ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>n_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> Monsters, Inc. 1 <span style='color: #BCBCBC;'> 2</span> Finding Nemo 1 <span style='color: #BCBCBC;'> 3</span> The Incredibles 2 <span style='color: #BCBCBC;'> 4</span> Ratatouille 1 <span style='color: #BCBCBC;'> 5</span> WALL-E 1 <span style='color: #BCBCBC;'> 6</span> Up 2 <span style='color: #BCBCBC;'> 7</span> Toy Story 3 2 <span style='color: #BCBCBC;'> 8</span> Brave 1 <span style='color: #BCBCBC;'> 9</span> Inside Out 1 <span style='color: #BCBCBC;'>10</span> Coco 2 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ```r pixar_films %>% left_join(academy_won, by = "film") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 4 Monsters, Inc. 2001-11-02 92 G 1 <span style='color: #BCBCBC;'> 2</span> 5 Finding Nemo 2003-05-30 100 G 1 <span style='color: #BCBCBC;'> 3</span> 6 The Incredibles 2004-11-05 115 PG 2 <span style='color: #BCBCBC;'> 4</span> 8 Ratatouille 2007-06-29 111 G 1 <span style='color: #BCBCBC;'> 5</span> 9 WALL-E 2008-06-27 98 G 1 <span style='color: #BCBCBC;'> 6</span> 10 Up 2009-05-29 96 PG 2 <span style='color: #BCBCBC;'> 7</span> 11 Toy Story 3 2010-06-18 103 G 2 <span style='color: #BCBCBC;'> 8</span> 13 Brave 2012-06-22 93 PG 1 <span style='color: #BCBCBC;'> 9</span> 15 Inside Out 2015-06-19 95 PG 1 <span style='color: #BCBCBC;'>10</span> 19 Coco 2017-11-22 105 PG 2 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Join with postprocessing .pull-left[ ## Raw result ```r pixar_films %>% left_join(academy_won, by = "film") %>% arrange(release_date) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BB0000;'>NA</span> <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BB0000;'>NA</span> <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BB0000;'>NA</span> <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G 1 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G 1 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG 2 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BB0000;'>NA</span> <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G 1 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G 1 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG 2 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## After postprocessing ```r pixar_films %>% left_join(academy_won, by = "film") %>% mutate(n_won = coalesce(n_won, 0L)) %>% arrange(release_date) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G 0 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G 0 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G 0 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G 1 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G 1 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG 2 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G 0 <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G 1 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G 1 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG 2 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Join with processing Computed on the database, original data unchanged. ```r pixar_films %>% left_join(academy_won, by = "film") %>% mutate(n_won = coalesce(n_won, 0L)) %>% arrange(release_date) %>% show_query() ``` ``` <SQL> SELECT "number", "film", "release_date", "run_time", "film_rating", COALESCE("n_won", 0) AS "n_won" FROM (SELECT "number", "LHS"."film" AS "film", "release_date", "run_time", "film_rating", "n_won" FROM "pixar_films" AS "LHS" LEFT JOIN (SELECT "film", COUNT(*) AS "n_won" FROM "academy" WHERE ("status" = 'Won') GROUP BY "film") "RHS" ON ("LHS"."film" = "RHS"."film") ) "q01" ORDER BY "release_date" ``` --- # Tables must be on the same source Use `copy = TRUE` to enforce, the result is a lazy table if the LHS is a lazy table. .pull-left[ ## Bad ```r try( academy %>% left_join(pixar_films_sqlite, by = "film") ) ``` <PRE class="fansi fansi-output"><CODE>Error in auto_copy(x, y, copy = copy, indexes = if (auto_index) list(by$y)) : `x` and `y` must share the same src. <span style='color: #00BBBB;'>ℹ</span> set `copy` = TRUE (may be slow). </CODE></PRE> ] .pull-right[ ## Not too bad ```r academy %>% left_join(pixar_films_sqlite, by = "film", copy = TRUE) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Fe… Award… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Sc… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Scr… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Sc… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original So… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Fe… Award… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Scr… Ineli… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Sc… Nomin… 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Fe… Award… 3 1999-11-24 92 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Copying is expensive! A temporary table is created on the LHS database. If the RHS comes from a different database, results are temporarily loaded into the local session! ```r academy %>% left_join(pixar_films_sqlite, by = "film", copy = TRUE) %>% show_query() ``` ``` <SQL> SELECT "LHS"."film" AS "film", "award_type", "status", "number", "release_date", "run_time", "film_rating" FROM "academy" AS "LHS" LEFT JOIN "dbplyr_002" AS "RHS" ON ("LHS"."film" = "RHS"."film") ``` --- # Joining data frames with lazy tables The result is a data frame too. .pull-left[ ## Bad ```r try( pixarfilms::academy %>% left_join(pixar_films, by = "film") ) ``` <PRE class="fansi fansi-output"><CODE>Error in auto_copy(x, y, copy = copy) : `x` and `y` must share the same src. <span style='color: #00BBBB;'>ℹ</span> set `copy` = TRUE (may be slow). </CODE></PRE> ] .pull-right[ ## Could be worse ```r pixarfilms::academy %>% left_join(pixar_films, by = "film", copy = TRUE) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 80 × 7</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Fe… Award… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Sc… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Scr… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Sc… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original So… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Fe… Award… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Scr… Ineli… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Sc… Nomin… 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Fe… Award… 3 1999-11-24 92 G <span style='color: #949494;'># … with 70 more rows</span> </CODE></PRE> ] --- # DuckDB: register data frames as database tables Temporarily use a local data frame as a table. Also works for Arrow datasets via `duckdb::duckdb_register_arrow()`. .pull-left[ ## Register and access ```r duckdb::duckdb_register( con_duckdb, "academy_small", pixarfilms::academy[1:3, ] ) academy_small <- tbl(con_duckdb, "academy_small") academy_small ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<academy_small> [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> Toy Story Animated Feature Award not yet introduced <span style='color: #BCBCBC;'>2</span> Toy Story Original Screenplay Nominated <span style='color: #BCBCBC;'>3</span> Toy Story Adapted Screenplay Ineligible </CODE></PRE> ] .pull-right[ ## Use ```r academy_small %>% left_join(pixar_films, by = "film") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> Toy Story Adapted Screenp… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'>2</span> Toy Story Original Screen… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'>3</span> Toy Story Animated Feature Award… 1 1995-11-22 81 G </CODE></PRE> ] --- # DuckDB: Performance comparison Baseline: Data frames. .pull-left[ ```r nrow(nycflights13::flights) ``` ``` [1] 336776 ``` ] .pull-right[ ```r system.time( nycflights13::flights %>% count(year, month, day) ) ``` ``` user system elapsed 0.010 0.000 0.011 ``` ] --- # DuckDB: Performance comparison With registration. .pull-left[ ```r system.time(duckdb::duckdb_register( con_duckdb, "flights", nycflights13::flights )) ``` ``` user system elapsed 0.003 0.000 0.003 ``` ```r flights_register <- tbl(con_duckdb, "flights") flights_register %>% count() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 1]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> <span style='text-decoration: underline;'>336</span>776 </CODE></PRE> ] .pull-right[ ```r system.time( flights_register %>% count(year, month, day) %>% collect() ) ``` ``` user system elapsed 0.032 0.003 0.038 ``` ] --- # DuckDB: Performance comparison With copy. .pull-left[ ```r system.time( flights_copy <- copy_to(con_duckdb, nycflights13::flights) ) ``` ``` user system elapsed 0.090 0.014 0.116 ``` ```r flights_copy %>% count() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 1]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> <span style='text-decoration: underline;'>336</span>776 </CODE></PRE> ] .pull-right[ ```r system.time( flights_copy %>% count(year, month, day) %>% collect() ) ``` ``` user system elapsed 0.008 0.001 0.007 ``` ] --- # ETL, revisited Insert a second table into our database. ```r db_path <- fs::path_abs("pixar.duckdb") con <- DBI::dbConnect(duckdb::duckdb(dbdir = db_path)) DBI::dbWriteTable(con, "academy", pixarfilms::academy, overwrite = TRUE) DBI::dbExecute(con, "CREATE UNIQUE INDEX academy_pk ON academy (film, award_type)") ``` ``` [1] 0 ``` ```r DBI::dbExecute(con, "CREATE INDEX academy_fk ON academy (film)") ``` ``` [1] 0 ``` ```r DBI::dbDisconnect(con) ``` --- background-image: url("data:image/png;base64,#images/21-frame.webp") background-size: 40% background-position: 100% 100% # Joins: Exercises 1 .pull-left[ 1. How many rows does the join between `academy` and `pixar_films` contain? Try to find out without loading all the data into memory. Explain. 2. Which films are not yet listed in the `academy` table? What does the resulting SQL query look like? - Hint: Use `anti_join()` ] --- background-image: url("data:image/png;base64,#images/21-frame.webp") background-size: 40% background-position: 100% 100% # Joins: Exercises 2 .pull-left[ 3. Transform `academy` into a wide table so that there is at most one row per film. Join the resulting table with the `pixar_films` table. - Hint: Use `pivot_wider()`, `spread()`, `dcast()`, ... . You need to compute locally, because these functions don't work on the database. ] .pull-right[ 4. Plot a bar chart with the number of awards won and nominated per year. Compute as much as possible on the database. - Hint: "Long form" or "wide form"? ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - **The {dm} package** - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/22.webp") background-size: 40% background-position: 100% 100% # Data model basics .pull-left[ - Compound object for multiple tables - Features Script: `databases_22.R` ```r library(tidyverse) library(dm) ``` ] --- background-image: url("data:image/png;base64,#images/22-frame.webp") background-size: 40% background-position: 100% 100% # Data model basics .pull-left[ - Compound object for multiple tables - Features Script: `databases_22.R` ```r library(tidyverse) library(dm) ``` ] --- # Data model objects .pull-left[ Store multiple tables in an object. ```r pixar_dm <- dm_pixarfilms() pixar_dm ``` <PRE class="fansi fansi-output"><CODE><span style='color: #FFAFFF;'>──</span> <span style='color: #FFAFFF;'>Metadata</span> <span style='color: #FFAFFF;'>───────────────────────────────────────────────────────</span> Tables: `pixar_films`, `pixar_people`, `academy`, `box_office`, `genres`, `public_response` Columns: 23 Primary keys: 5 Foreign keys: 5 </CODE></PRE> ```r pixar_dm %>% dm_draw() ``` ] .pull-right[ <!-- Generated by graphviz version 2.40.1 (20161225.0304) --> <!-- Title: %0 Pages: 1 --> <svg width="315pt" height="330pt" viewBox="-100.00 0.00 315.00 330.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"> <g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 326)"> <title>%0</title> <g id="a_graph0"><a xlink:title="Data Model"> <polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-326 211,-326 211,4 -4,4"/> </a> </g> <!-- academy --> <g id="node1" class="node"> <title>academy</title> <polygon fill="#ed7d31" stroke="transparent" points="1.5,-301 1.5,-321 100.5,-321 100.5,-301 1.5,-301"/> <text text-anchor="start" x="26.1255" y="-306.4" font-family="Times,serif" font-size="14.00" fill="#ffffff">academy</text> <polygon fill="#fbe5d5" stroke="transparent" points="1.5,-281 1.5,-301 100.5,-301 100.5,-281 1.5,-281"/> <text text-anchor="start" x="3.5" y="-286.4" font-family="Times,serif" font-size="14.00" fill="#444444">film</text> <polygon fill="#fbe5d5" stroke="transparent" points="1.5,-261 1.5,-281 100.5,-281 100.5,-261 1.5,-261"/> <text text-anchor="start" x="3.183" y="-267.4" font-family="Times,serif" text-decoration="underline" font-size="14.00" fill="#444444">film, award_type</text> <polygon fill="none" stroke="#9e5320" stroke-opacity="0.666667" points="0,-260 0,-322 101,-322 101,-260 0,-260"/> </g> <!-- pixar_films --> <g id="node4" class="node"> <title>pixar_films</title> <polygon fill="#5b9bd5" stroke="transparent" points="138,-161 138,-181 206,-181 206,-161 138,-161"/> <text text-anchor="start" x="139.7286" y="-166.4" font-family="Times,serif" font-size="14.00" fill="#ffffff">pixar_films</text> <polygon fill="#deebf6" stroke="transparent" points="138,-141 138,-161 206,-161 206,-141 138,-141"/> <text text-anchor="start" x="140" y="-147.4" font-family="Times,serif" text-decoration="underline" font-size="14.00" fill="#444444">film</text> <polygon fill="none" stroke="#3c678e" stroke-opacity="0.666667" points="137,-140 137,-182 207,-182 207,-140 137,-140"/> </g> <!-- academy->pixar_films --> <g id="edge2" class="edge"> <title>academy:film->pixar_films:film</title> <path fill="none" stroke="#555555" d="M100.5,-291C161.0155,-291 84.073,-167.4399 127.9432,-152.4775"/> <polygon fill="#555555" stroke="#555555" points="128.615,-155.9165 138,-151 127.5974,-148.9909 128.615,-155.9165"/> </g> <!-- box_office --> <g id="node2" class="node"> <title>box_office</title> <polygon fill="#ed7d31" stroke="transparent" points="18.5,-221 18.5,-241 82.5,-241 82.5,-221 18.5,-221"/> <text text-anchor="start" x="20.1795" y="-226.4" font-family="Times,serif" font-size="14.00" fill="#ffffff">box_office</text> <polygon fill="#fbe5d5" stroke="transparent" points="18.5,-201 18.5,-221 82.5,-221 82.5,-201 18.5,-201"/> <text text-anchor="start" x="20.5" y="-207.4" font-family="Times,serif" text-decoration="underline" font-size="14.00" fill="#444444">film</text> <polygon fill="none" stroke="#9e5320" stroke-opacity="0.666667" points="17.5,-200 17.5,-242 83.5,-242 83.5,-200 17.5,-200"/> </g> <!-- box_office->pixar_films --> <g id="edge3" class="edge"> <title>box_office:film->pixar_films:film</title> <path fill="none" stroke="#555555" d="M82.5,-211C114.9945,-211 104.6973,-162.9888 128.1654,-152.8615"/> <polygon fill="#555555" stroke="#555555" points="128.8254,-156.2988 138,-151 127.5235,-149.4209 128.8254,-156.2988"/> </g> <!-- genres --> <g id="node3" class="node"> <title>genres</title> <polygon fill="#ed7d31" stroke="transparent" points="18.5,-161 18.5,-181 83.5,-181 83.5,-161 18.5,-161"/> <text text-anchor="start" x="32.7328" y="-166.4" font-family="Times,serif" font-size="14.00" fill="#ffffff">genres</text> <polygon fill="#fbe5d5" stroke="transparent" points="18.5,-141 18.5,-161 83.5,-161 83.5,-141 18.5,-141"/> <text text-anchor="start" x="20.5" y="-146.4" font-family="Times,serif" font-size="14.00" fill="#444444">film</text> <polygon fill="#fbe5d5" stroke="transparent" points="18.5,-121 18.5,-141 83.5,-141 83.5,-121 18.5,-121"/> <text text-anchor="start" x="20.2889" y="-127.4" font-family="Times,serif" text-decoration="underline" font-size="14.00" fill="#444444">film, genre</text> <polygon fill="none" stroke="#9e5320" stroke-opacity="0.666667" points="17,-120 17,-182 84,-182 84,-120 17,-120"/> </g> <!-- genres->pixar_films --> <g id="edge4" class="edge"> <title>genres:film->pixar_films:film</title> <path fill="none" stroke="#555555" d="M83.5,-151C103.9375,-151 111.4419,-151 127.8378,-151"/> <polygon fill="#555555" stroke="#555555" points="128,-154.5001 138,-151 128,-147.5001 128,-154.5001"/> </g> <!-- pixar_people --> <g id="node5" class="node"> <title>pixar_people</title> <polygon fill="#70ad47" stroke="transparent" points="12.5,-81 12.5,-101 89.5,-101 89.5,-81 12.5,-81"/> <text text-anchor="start" x="14.4572" y="-86.4" font-family="Times,serif" font-size="14.00" fill="#ffffff">pixar_people</text> <polygon fill="#e2eeda" stroke="transparent" points="12.5,-61 12.5,-81 89.5,-81 89.5,-61 12.5,-61"/> <text text-anchor="start" x="14.5" y="-66.4" font-family="Times,serif" font-size="14.00" fill="#444444">film</text> <polygon fill="none" stroke="#4a732f" stroke-opacity="0.666667" points="11,-60 11,-102 90,-102 90,-60 11,-60"/> </g> <!-- pixar_people->pixar_films --> <g id="edge1" class="edge"> <title>pixar_people:film->pixar_films:film</title> <path fill="none" stroke="#555555" d="M89.5,-71C127.1813,-71 102.2491,-136.7031 128.1904,-149.0225"/> <polygon fill="#555555" stroke="#555555" points="127.5055,-152.4547 138,-151 128.8889,-145.5928 127.5055,-152.4547"/> </g> <!-- public_response --> <g id="node6" class="node"> <title>public_response</title> <polygon fill="#ed7d31" stroke="transparent" points="3.5,-21 3.5,-41 97.5,-41 97.5,-21 3.5,-21"/> <text text-anchor="start" x="5.0126" y="-26.4" font-family="Times,serif" font-size="14.00" fill="#ffffff">public_response</text> <polygon fill="#fbe5d5" stroke="transparent" points="3.5,-1 3.5,-21 97.5,-21 97.5,-1 3.5,-1"/> <text text-anchor="start" x="5.5" y="-7.4" font-family="Times,serif" text-decoration="underline" font-size="14.00" fill="#444444">film</text> <polygon fill="none" stroke="#9e5320" stroke-opacity="0.666667" points="2.5,0 2.5,-42 98.5,-42 98.5,0 2.5,0"/> </g> <!-- public_response->pixar_films --> <g id="edge5" class="edge"> <title>public_response:film->pixar_films:film</title> <path fill="none" stroke="#555555" d="M97.5,-11C158.3517,-11 83.4456,-134.5601 127.8579,-149.5225"/> <polygon fill="#555555" stroke="#555555" points="127.5999,-153.0217 138,-151 128.6091,-146.0949 127.5999,-153.0217"/> </g> </g> </svg> ] --- # Data model objects Use like a named list. ```r names(pixar_dm) ``` ``` [1] "pixar_films" "pixar_people" "academy" "box_office" "genres" "public_response" ``` .pull-left[ ```r pixar_dm$pixar_films ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] .pull-right[ ```r pixar_dm$academy ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 80 × 3</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Feature Award not yet introduced <span style='color: #BCBCBC;'> 2</span> Toy Story Original Screenplay Nominated <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Screenplay Ineligible <span style='color: #BCBCBC;'> 4</span> Toy Story Original Score Nominated <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nominated <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won Special Achievement <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Feature Award not yet introduced <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Screenplay Ineligible <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Score Nominated <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Feature Award not yet introduced <span style='color: #949494;'># … with 70 more rows</span> </CODE></PRE> ] --- # Showcase: wrapping all tables in a data model One of the many operations supported by {dm}. .pull-left[ ```r pixar_films_wrapped <- pixar_dm %>% dm_wrap_tbl(pixar_films) %>% pull_tbl(pixar_films) pixar_films_wrapped ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 10</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='font-weight: bold;'>pixar_p…</span> <span style='font-weight: bold;'>academy</span> <span style='font-weight: bold;'>box_off…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><nested></span> <span style='color: #949494; font-style: italic;'><nested></span> <span style='color: #949494; font-style: italic;'><nested></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy … 1995-11-22 81 G <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'> 2</span> 2 A Bu… 1998-11-25 95 G <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'> 3</span> 3 Toy … 1999-11-24 92 G <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'> 4</span> 4 Mons… 2001-11-02 92 G <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'> 5</span> 5 Find… 2003-05-30 100 G <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'> 6</span> 6 The … 2004-11-05 115 PG <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'> 8</span> 8 Rata… 2007-06-29 111 G <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'> 9</span> 9 WALL… 2008-06-27 98 G <tibble> <tibble> <tibble> <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <tibble> <tibble> <tibble> <span style='color: #949494;'># … with 17 more rows, and 2 more variables: </span><span style='color: #949494; font-weight: bold;'>genres</span><span style='color: #949494;'> <nested>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>public_response</span><span style='color: #949494;'> <nested></span> </CODE></PRE> ] .pull-right[ ```r pixar_films_wrapped$academy[1:2] ``` <PRE class="fansi fansi-output"><CODE>[[1]] <span style='color: #949494;'># A tibble: 6 × 2</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> Animated Feature Award not yet introduced <span style='color: #BCBCBC;'>2</span> Original Screenplay Nominated <span style='color: #BCBCBC;'>3</span> Adapted Screenplay Ineligible <span style='color: #BCBCBC;'>4</span> Original Score Nominated <span style='color: #BCBCBC;'>5</span> Original Song Nominated <span style='color: #BCBCBC;'>6</span> Other Won Special Achievement [[2]] <span style='color: #949494;'># A tibble: 3 × 2</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> Animated Feature Award not yet introduced <span style='color: #BCBCBC;'>2</span> Adapted Screenplay Ineligible <span style='color: #BCBCBC;'>3</span> Original Score Nominated </CODE></PRE> ] --- background-image: url("data:image/png;base64,#images/22-frame.webp") background-size: 40% background-position: 100% 100% # Data model basics: Exercises .pull-left[ - Experiment! ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - **A bit of theory** - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/23.webp") background-size: 40% background-position: 100% 100% # Data models .pull-left[ - Keys, relationships, constraints - Zooming Script: `databases_23.R` ```r library(tidyverse) library(dm) ``` ] --- background-image: url("data:image/png;base64,#images/23-frame.webp") background-size: 40% background-position: 100% 100% # Data models .pull-left[ - Keys, relationships, constraints - Zooming Script: `databases_23.R` ```r library(tidyverse) library(dm) ``` ] --- # Primary keys Column(s) that uniquely identify rows in a table. .pull-left[ ```r any(duplicated(pixar_dm$pixar_films$film)) ``` ``` [1] FALSE ``` ```r check_key(pixar_dm$pixar_films, film) ``` ] .pull-right[ ```r any(duplicated(pixar_dm$academy[c("film", "award_type")])) ``` ``` [1] FALSE ``` ```r check_key(pixar_dm$academy, film, award_type) try( check_key(pixar_dm$academy, film) ) ``` ``` Error in abort_not_unique_key(as_label(data_q), orig_names) : (`film`) not a unique key of `pixar_dm$academy`. ``` ] --- # Foreign keys Column(s) that point to a primary key in another table. ```r all(pixar_dm$academy$film %in% pixar_dm$pixar_films$film) ``` ``` [1] TRUE ``` ```r check_subset(pixar_dm$academy, film, pixar_dm$pixar_films, film) try( check_subset(pixar_dm$pixar_films, film, pixar_dm$academy, film) ) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 4 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 24 Luca 2021-06-18 151 N/A <span style='color: #BCBCBC;'>2</span> 25 Turning Red 2022-03-11 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>3</span> 26 Lightyear 2022-06-17 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>4</span> 27 <span style='color: #BB0000;'>NA</span> 2023-06-16 155 Not Rated Error in abort_not_subset_of(as_label(t1q), col_names_1, as_label(t2q), : Column (`film`) of table `pixar_dm$pixar_films` contains values (see examples above) that are not present in column (`film`) of table `pixar_dm$academy`. </CODE></PRE> --- # Constraints Properties of primary and foreign keys can be checked. ```r pixar_dm %>% dm_examine_constraints() ``` <PRE class="fansi fansi-message"><CODE><span style='color: #BBBB00;'>!</span> Unsatisfied constraints: </CODE></PRE><PRE class="fansi fansi-output"><CODE><span style='color: #BB0000;'>•</span> Table `pixar_films`: primary key `film`: has 1 missing values </CODE></PRE> ```r dm_pixarfilms(consistent = TRUE) %>% dm_examine_constraints() ``` <PRE class="fansi fansi-message"><CODE><span style='color: #00BBBB;'>ℹ</span> All constraints satisfied. </CODE></PRE> --- # Constraints An example from another dataset. ```r dm_nycflights13() %>% dm_examine_constraints() ``` <PRE class="fansi fansi-message"><CODE><span style='color: #BBBB00;'>!</span> Unsatisfied constraints: </CODE></PRE><PRE class="fansi fansi-output"><CODE><span style='color: #BB0000;'>•</span> Table `flights`: foreign key `tailnum` into table `planes`: values of `flights$tailnum` not in `planes$tailnum`: N725MQ (6), N537MQ (5), N722MQ (5), N730MQ (5), N736MQ (5), … </CODE></PRE> --- # Zooming Focusing on one table in a dm object. Allows applying data transformations on that table inside a dm object. .pull-left[ ```r pixar_dm %>% dm_zoom_to(academy) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Zoomed table: academy</span> <span style='color: #949494;'># A tibble: 80 × 3</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Feature Award not yet introduced <span style='color: #BCBCBC;'> 2</span> Toy Story Original Screenplay Nominated <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Screenplay Ineligible <span style='color: #BCBCBC;'> 4</span> Toy Story Original Score Nominated <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nominated <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won Special Achievement <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Feature Award not yet introduced <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Screenplay Ineligible <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Score Nominated <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Feature Award not yet introduced <span style='color: #949494;'># … with 70 more rows</span> </CODE></PRE> ] .pull-right[ ```r pixar_dm %>% dm_zoom_to(academy) %>% left_join(pixar_films, select = c(film, release_date)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Zoomed table: academy</span> <span style='color: #949494;'># A tibble: 80 × 4</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>release_d…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Feature Award not yet intro… 1995-11-22 <span style='color: #BCBCBC;'> 2</span> Toy Story Original Screenplay Nominated 1995-11-22 <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Screenplay Ineligible 1995-11-22 <span style='color: #BCBCBC;'> 4</span> Toy Story Original Score Nominated 1995-11-22 <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nominated 1995-11-22 <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won Special Achieve… 1995-11-22 <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Feature Award not yet intro… 1998-11-25 <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Screenplay Ineligible 1998-11-25 <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Score Nominated 1998-11-25 <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Feature Award not yet intro… 1999-11-24 <span style='color: #949494;'># … with 70 more rows</span> </CODE></PRE> ] --- # Flattening Join a table to all related tables. ```r pixar_dm %>% dm_flatten_to_tbl(academy) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 80 × 7</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Feature Award not yet introduced 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Screenplay Nominated 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Screenplay Ineligible 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Score Nominated 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nominated 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won Special Achievement 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Feature Award not yet introduced 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Screenplay Ineligible 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Score Nominated 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Feature Award not yet introduced 3 1999-11-24 92 G <span style='color: #949494;'># … with 70 more rows</span> </CODE></PRE> --- # Flattening A larger example with a different dataset. ```r dm_nycflights13() %>% dm_select(weather, -year, -month, -day, -hour) %>% dm_flatten_to_tbl(flights) ``` ``` Renaming ambiguous columns: %>% dm_rename(flights, flights.year = year) %>% dm_rename(airlines, airlines.name = name) %>% dm_rename(airports, airports.name = name) %>% dm_rename(planes, planes.year = year) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 1,761 × 44</span> <span style='font-weight: bold;'>flights.year</span> <span style='font-weight: bold;'>month</span> <span style='font-weight: bold;'>day</span> <span style='font-weight: bold;'>dep_time</span> <span style='font-weight: bold;'>sched_de…</span> <span style='font-weight: bold;'>dep_d…</span> <span style='font-weight: bold;'>arr_t…</span> <span style='font-weight: bold;'>sched…</span> <span style='font-weight: bold;'>arr_d…</span> <span style='font-weight: bold;'>carri…</span> <span style='font-weight: bold;'>flight</span> <span style='font-weight: bold;'>tailn…</span> <span style='font-weight: bold;'>origin</span> <span style='font-weight: bold;'>dest</span> <span style='font-weight: bold;'>air_t…</span> <span style='font-weight: bold;'>dista…</span> <span style='font-weight: bold;'>hour</span> <span style='font-weight: bold;'>minute</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> <span style='text-decoration: underline;'>2</span>013 1 10 3 <span style='text-decoration: underline;'>2</span>359 4 426 437 -<span style='color: #BB0000;'>11</span> B6 727 N571JB JFK BQN 183 <span style='text-decoration: underline;'>1</span>576 23 59 <span style='color: #BCBCBC;'> 2</span> <span style='text-decoration: underline;'>2</span>013 1 10 16 <span style='text-decoration: underline;'>2</span>359 17 447 444 3 B6 739 N564JB JFK PSE 191 <span style='text-decoration: underline;'>1</span>617 23 59 <span style='color: #BCBCBC;'> 3</span> <span style='text-decoration: underline;'>2</span>013 1 10 450 500 -<span style='color: #BB0000;'>10</span> 634 648 -<span style='color: #BB0000;'>14</span> US <span style='text-decoration: underline;'>1</span>117 N171US EWR CLT 78 529 5 0 <span style='color: #BCBCBC;'> 4</span> <span style='text-decoration: underline;'>2</span>013 1 10 520 525 -<span style='color: #BB0000;'>5</span> 813 820 -<span style='color: #BB0000;'>7</span> UA <span style='text-decoration: underline;'>1</span>018 N35204 EWR IAH 215 <span style='text-decoration: underline;'>1</span>400 5 25 <span style='color: #BCBCBC;'> 5</span> <span style='text-decoration: underline;'>2</span>013 1 10 530 530 0 824 829 -<span style='color: #BB0000;'>5</span> UA 404 N815UA LGA IAH 210 <span style='text-decoration: underline;'>1</span>416 5 30 <span style='color: #BCBCBC;'> 6</span> <span style='text-decoration: underline;'>2</span>013 1 10 531 540 -<span style='color: #BB0000;'>9</span> 832 850 -<span style='color: #BB0000;'>18</span> AA <span style='text-decoration: underline;'>1</span>141 N5EAAA JFK MIA 149 <span style='text-decoration: underline;'>1</span>089 5 40 <span style='color: #BCBCBC;'> 7</span> <span style='text-decoration: underline;'>2</span>013 1 10 535 540 -<span style='color: #BB0000;'>5</span> <span style='text-decoration: underline;'>1</span>015 <span style='text-decoration: underline;'>1</span>017 -<span style='color: #BB0000;'>2</span> B6 725 N784JB JFK BQN 191 <span style='text-decoration: underline;'>1</span>576 5 40 <span style='color: #BCBCBC;'> 8</span> <span style='text-decoration: underline;'>2</span>013 1 10 546 600 -<span style='color: #BB0000;'>14</span> 645 709 -<span style='color: #BB0000;'>24</span> B6 380 N337JB EWR BOS 39 200 6 0 <span style='color: #BCBCBC;'> 9</span> <span style='text-decoration: underline;'>2</span>013 1 10 549 600 -<span style='color: #BB0000;'>11</span> 652 724 -<span style='color: #BB0000;'>32</span> EV <span style='text-decoration: underline;'>6</span>055 N19554 LGA IAD 48 229 6 0 <span style='color: #BCBCBC;'>10</span> <span style='text-decoration: underline;'>2</span>013 1 10 550 600 -<span style='color: #BB0000;'>10</span> 649 703 -<span style='color: #BB0000;'>14</span> US <span style='text-decoration: underline;'>2</span>114 N740UW LGA BOS 36 184 6 0 <span style='color: #949494;'># … with 1,751 more rows, and 26 more variables: </span><span style='color: #949494; font-weight: bold;'>time_hour</span><span style='color: #949494;'> <dttm>, </span><span style='color: #949494; font-weight: bold;'>airlines.name</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>airports.name</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>lat</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>lon</span><span style='color: #949494;'> <dbl>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>alt</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>tz</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>dst</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>tzone</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>planes.year</span><span style='color: #949494;'> <int>, </span><span style='color: #949494; font-weight: bold;'>type</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>manufacturer</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>model</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>engines</span><span style='color: #949494;'> <int>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>seats</span><span style='color: #949494;'> <int>, </span><span style='color: #949494; font-weight: bold;'>speed</span><span style='color: #949494;'> <int>, </span><span style='color: #949494; font-weight: bold;'>engine</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>temp</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>dewp</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>humid</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>wind_dir</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>wind_speed</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>wind_gust</span><span style='color: #949494;'> <dbl>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>precip</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>pressure</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>visib</span><span style='color: #949494;'> <dbl></span> </CODE></PRE> --- background-image: url("data:image/png;base64,#images/23-frame.webp") background-size: 40% background-position: 100% 100% # Data models: Exercises .pull-left[ - Experiment! ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - **Playing the whole game** ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/24.webp") background-size: 40% background-position: 100% 100% # The whole game .pull-left[ - Build a local data model - Copy it to the database - Consume it Script: `databases_24.R` ```r library(tidyverse) library(dm) ``` ] --- background-image: url("data:image/png;base64,#images/24-frame.webp") background-size: 40% background-position: 100% 100% # The whole game .pull-left[ - Build a local data model - Copy it to the database - Consume it Script: `databases_24.R` ```r library(tidyverse) library(dm) ``` ] --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Define transformed table - Fix type of `number` column - Extract `franchise` and `sequel` columns ```r pixar_films_clean ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 26 × 7</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>franchise</span> <span style='font-weight: bold;'>sequel</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life A Bug's Life <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. Monsters, Inc. <span style='color: #BB0000;'>NA</span> 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo Finding Nemo <span style='color: #BB0000;'>NA</span> 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles The Incredibles <span style='color: #BB0000;'>NA</span> 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars Cars 1 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille Ratatouille <span style='color: #BB0000;'>NA</span> 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E WALL-E <span style='color: #BB0000;'>NA</span> 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up Up <span style='color: #BB0000;'>NA</span> 2009-05-29 96 PG <span style='color: #949494;'># … with 16 more rows</span> </CODE></PRE> --- # Define dm object Use `dm()` to create a dm object, pass tables (data frames or lazy tables). ```r base_dm <- dm( pixar_films = pixar_films_clean, academy = pixarfilms::academy, box_office = pixarfilms::box_office, ) base_dm ``` <PRE class="fansi fansi-output"><CODE><span style='color: #FFAFFF;'>──</span> <span style='color: #FFAFFF;'>Metadata</span> <span style='color: #FFAFFF;'>───────────────────────────────────────────────────────</span> Tables: `pixar_films`, `academy`, `box_office` Columns: 15 Primary keys: 0 Foreign keys: 0 </CODE></PRE> --- # Add keys Using `dm_add_pk()` and `dm_add_fk()`. .pull-left[ ```r full_dm <- base_dm %>% dm_add_pk(pixar_films, film) %>% dm_add_pk(box_office, film) %>% dm_add_fk(academy, film, pixar_films) %>% dm_add_fk(box_office, film, pixar_films) full_dm ``` <PRE class="fansi fansi-output"><CODE><span style='color: #FFAFFF;'>──</span> <span style='color: #FFAFFF;'>Metadata</span> <span style='color: #FFAFFF;'>───────────────────────────────────────────────────────</span> Tables: `pixar_films`, `academy`, `box_office` Columns: 15 Primary keys: 2 Foreign keys: 2 </CODE></PRE> ```r full_dm %>% dm_draw(view_type = "all") ``` ] .pull-right[ <!-- Generated by graphviz version 2.40.1 (20161225.0304) --> <!-- Title: %0 Pages: 1 --> <svg width="252pt" height="230pt" viewBox="0.00 0.00 252.00 230.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"> <g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 226)"> <title>%0</title> <g id="a_graph0"><a xlink:title="Data Model"> <polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-226 248,-226 248,4 -4,4"/> </a> </g> <!-- academy --> <g id="node1" class="node"> <title>academy</title> <polygon fill="#efebdd" stroke="transparent" points="32.5,-201 32.5,-221 101.5,-221 101.5,-201 32.5,-201"/> <text text-anchor="start" x="42.1255" y="-206.4" font-family="Times,serif" font-size="14.00" fill="#000000">academy</text> <polygon fill="#ffffff" stroke="transparent" points="32.5,-181 32.5,-201 101.5,-201 101.5,-181 32.5,-181"/> <text text-anchor="start" x="34.5" y="-186.4" font-family="Times,serif" font-size="14.00" fill="#444444">film</text> <polygon fill="#ffffff" stroke="transparent" points="32.5,-161 32.5,-181 101.5,-181 101.5,-161 32.5,-161"/> <text text-anchor="start" x="34.3492" y="-166.4" font-family="Times,serif" font-size="14.00" fill="#444444">award_type</text> <polygon fill="#ffffff" stroke="transparent" points="32.5,-141 32.5,-161 101.5,-161 101.5,-141 32.5,-141"/> <text text-anchor="start" x="34.5" y="-146.4" font-family="Times,serif" font-size="14.00" fill="#444444">status</text> <polygon fill="none" stroke="#555555" points="31,-140 31,-222 102,-222 102,-140 31,-140"/> </g> <!-- pixar_films --> <g id="node3" class="node"> <title>pixar_films</title> <polygon fill="#efebdd" stroke="transparent" points="170.5,-171 170.5,-191 243.5,-191 243.5,-171 170.5,-171"/> <text text-anchor="start" x="174.7286" y="-176.4" font-family="Times,serif" font-size="14.00" fill="#000000">pixar_films</text> <polygon fill="#ffffff" stroke="transparent" points="170.5,-151 170.5,-171 243.5,-171 243.5,-151 170.5,-151"/> <text text-anchor="start" x="172.5" y="-156.4" font-family="Times,serif" font-size="14.00" fill="#444444">number</text> <polygon fill="#ffffff" stroke="transparent" points="170.5,-131 170.5,-151 243.5,-151 243.5,-131 170.5,-131"/> <text text-anchor="start" x="172.5" y="-137.4" font-family="Times,serif" text-decoration="underline" font-size="14.00" fill="#444444">film</text> <polygon fill="#ffffff" stroke="transparent" points="170.5,-111 170.5,-131 243.5,-131 243.5,-111 170.5,-111"/> <text text-anchor="start" x="172.5" y="-116.4" font-family="Times,serif" font-size="14.00" fill="#444444">franchise</text> <polygon fill="#ffffff" stroke="transparent" points="170.5,-91 170.5,-111 243.5,-111 243.5,-91 170.5,-91"/> <text text-anchor="start" x="172.5" y="-96.4" font-family="Times,serif" font-size="14.00" fill="#444444">sequel</text> <polygon fill="#ffffff" stroke="transparent" points="170.5,-71 170.5,-91 243.5,-91 243.5,-71 170.5,-71"/> <text text-anchor="start" x="172.413" y="-76.4" font-family="Times,serif" font-size="14.00" fill="#444444">release_date</text> <polygon fill="#ffffff" stroke="transparent" points="170.5,-51 170.5,-71 243.5,-71 243.5,-51 170.5,-51"/> <text text-anchor="start" x="172.5" y="-56.4" font-family="Times,serif" font-size="14.00" fill="#444444">run_time</text> <polygon fill="#ffffff" stroke="transparent" points="170.5,-31 170.5,-51 243.5,-51 243.5,-31 170.5,-31"/> <text text-anchor="start" x="172.5" y="-36.4" font-family="Times,serif" font-size="14.00" fill="#444444">film_rating</text> <polygon fill="none" stroke="#555555" points="169,-30 169,-192 244,-192 244,-30 169,-30"/> </g> <!-- academy->pixar_films --> <g id="edge1" class="edge"> <title>academy:film->pixar_films:film</title> <path fill="none" stroke="#555555" d="M101.5,-191C135.5254,-191 133.5378,-150.6405 160.2481,-142.4425"/> <polygon fill="#555555" stroke="#555555" points="161.0852,-145.8593 170.5,-141 160.1099,-138.9275 161.0852,-145.8593"/> </g> <!-- box_office --> <g id="node2" class="node"> <title>box_office</title> <polygon fill="#efebdd" stroke="transparent" points="1.5,-101 1.5,-121 132.5,-121 132.5,-101 1.5,-101"/> <text text-anchor="start" x="36.6795" y="-106.4" font-family="Times,serif" font-size="14.00" fill="#000000">box_office</text> <polygon fill="#ffffff" stroke="transparent" points="1.5,-81 1.5,-101 132.5,-101 132.5,-81 1.5,-81"/> <text text-anchor="start" x="3.5" y="-87.4" font-family="Times,serif" text-decoration="underline" font-size="14.00" fill="#444444">film</text> <polygon fill="#ffffff" stroke="transparent" points="1.5,-61 1.5,-81 132.5,-81 132.5,-61 1.5,-61"/> <text text-anchor="start" x="3.5" y="-66.4" font-family="Times,serif" font-size="14.00" fill="#444444">budget</text> <polygon fill="#ffffff" stroke="transparent" points="1.5,-41 1.5,-61 132.5,-61 132.5,-41 1.5,-41"/> <text text-anchor="start" x="3.5" y="-46.4" font-family="Times,serif" font-size="14.00" fill="#444444">box_office_us_canada</text> <polygon fill="#ffffff" stroke="transparent" points="1.5,-21 1.5,-41 132.5,-41 132.5,-21 1.5,-21"/> <text text-anchor="start" x="3.5" y="-26.4" font-family="Times,serif" font-size="14.00" fill="#444444">box_office_other</text> <polygon fill="#ffffff" stroke="transparent" points="1.5,-1 1.5,-21 132.5,-21 132.5,-1 1.5,-1"/> <text text-anchor="start" x="3.2447" y="-6.4" font-family="Times,serif" font-size="14.00" fill="#444444">box_office_worldwide</text> <polygon fill="none" stroke="#555555" points="0,0 0,-122 133,-122 133,0 0,0"/> </g> <!-- box_office->pixar_films --> <g id="edge2" class="edge"> <title>box_office:film->pixar_films:film</title> <path fill="none" stroke="#555555" d="M132.5,-91C156.2685,-91 146.8719,-127.2579 160.7946,-138.022"/> <polygon fill="#555555" stroke="#555555" points="159.9132,-141.4126 170.5,-141 161.9666,-134.7205 159.9132,-141.4126"/> </g> </g> </svg> ] --- # ETL, revisited All tables and keys in a dm object can be copied in one swoop. ```r db_path <- fs::path_abs("pixar.duckdb") fs::file_delete(db_path) con <- DBI::dbConnect(duckdb::duckdb(dbdir = db_path)) pixar_dm_duckdb <- copy_dm_to(con, full_dm, temporary = FALSE) ``` ``` Warning: duckdb doesn't support foreign keys, these won't be set in the remote database but are preserved in the `dm` ``` ```r pixar_dm_duckdb ``` <PRE class="fansi fansi-output"><CODE><span style='color: #00BB00;'>──</span> <span style='color: #00BB00;'>Table source</span> <span style='color: #00BB00;'>───────────────────────────────────────────────────</span> src: duckdb_connection <span style='color: #FFAFFF;'>──</span> <span style='color: #FFAFFF;'>Metadata</span> <span style='color: #FFAFFF;'>───────────────────────────────────────────────────────</span> Tables: `pixar_films`, `academy`, `box_office` Columns: 15 Primary keys: 2 Foreign keys: 2 </CODE></PRE> --- # ETL, revisited The `temporary = FALSE` argument ensures that tables are permanent. ```r pixar_dm_duckdb$pixar_films ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<"pixar_films"> [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>franchise</span> <span style='font-weight: bold;'>sequel</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life A Bug's Life <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. Monsters, Inc. <span style='color: #BB0000;'>NA</span> 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo Finding Nemo <span style='color: #BB0000;'>NA</span> 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles The Incredibles <span style='color: #BB0000;'>NA</span> 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars Cars 1 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille Ratatouille <span style='color: #BB0000;'>NA</span> 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E WALL-E <span style='color: #BB0000;'>NA</span> 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up Up <span style='color: #BB0000;'>NA</span> 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> --- # ETL, revisited Operations on the database dm work the same as for the local dm. ```r pixar_dm_duckdb %>% dm_flatten_to_tbl(academy) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 9]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>franchise</span> <span style='font-weight: bold;'>sequel</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Feature Award not yet introduced 1 Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Screenplay Nominated 1 Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Screenplay Ineligible 1 Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Score Nominated 1 Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nominated 1 Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won Special Achievement 1 Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Feature Award not yet introduced 2 A Bug's Life <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Screenplay Ineligible 2 A Bug's Life <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Score Nominated 2 A Bug's Life <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Feature Award not yet introduced 3 Toy Story 2 1999-11-24 92 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> --- # Consume, revisited Keys are not (yet) loaded for DuckDB. This works better for SQL Server and Postgres. ```r DBI::dbDisconnect(con) con <- DBI::dbConnect(duckdb::duckdb(dbdir = db_path)) pixar_dm_learned <- dm_from_src(con) ``` ``` Keys could not be queried, use `learn_keys = FALSE` to mute this message. ``` ```r pixar_dm_learned ``` <PRE class="fansi fansi-output"><CODE><span style='color: #00BB00;'>──</span> <span style='color: #00BB00;'>Table source</span> <span style='color: #00BB00;'>───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</span> src: duckdb_connection <span style='color: #FFAFFF;'>──</span> <span style='color: #FFAFFF;'>Metadata</span> <span style='color: #FFAFFF;'>───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</span> Tables: `academy`, `box_office`, `pixar_films` Columns: 15 Primary keys: 0 Foreign keys: 0 </CODE></PRE> --- # Consume, revisited Implement a helper function for your data model. .pull-left[ ```r dm_pixarfilms_small <- function() { db_path <- fs::path_abs("pixar.duckdb") con <- DBI::dbConnect(duckdb::duckdb(dbdir = db_path)) table_names <- c("academy", "box_office", "pixar_films") dm_from_src( con, table_names = table_names, learn_keys = FALSE ) %>% dm_add_pk(pixar_films, film) %>% dm_add_pk(box_office, film) %>% dm_add_fk(academy, film, pixar_films) %>% dm_add_fk(box_office, film, pixar_films) } ``` ] .pull-right[ ```r dm_pixarfilms_small() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #00BB00;'>──</span> <span style='color: #00BB00;'>Table source</span> <span style='color: #00BB00;'>───────────────────────────────────────────────────</span> src: duckdb_connection <span style='color: #FFAFFF;'>──</span> <span style='color: #FFAFFF;'>Metadata</span> <span style='color: #FFAFFF;'>───────────────────────────────────────────────────────</span> Tables: `academy`, `box_office`, `pixar_films` Columns: 15 Primary keys: 2 Foreign keys: 2 </CODE></PRE> ] --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Consuming with dm Create a new derived table in the dm object. .pull-left[ ```r my_dm <- dm_pixarfilms_small() my_helper_dm <- my_dm %>% dm_zoom_to(academy) %>% filter(status == "Won") %>% count(film, name = "n_awards_won") %>% dm_insert_zoomed("academy_won") ``` ] .pull-right[ ```r my_helper_dm ``` <PRE class="fansi fansi-output"><CODE><span style='color: #00BB00;'>──</span> <span style='color: #00BB00;'>Table source</span> <span style='color: #00BB00;'>───────────────────────────────────────────────────</span> src: duckdb_connection <span style='color: #FFAFFF;'>──</span> <span style='color: #FFAFFF;'>Metadata</span> <span style='color: #FFAFFF;'>───────────────────────────────────────────────────────</span> Tables: `academy`, `box_office`, `pixar_films`, `academy_won` Columns: 17 Primary keys: 2 Foreign keys: 3 </CODE></PRE> ] --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Consuming with dm Use the derived table to cross-reference box office and awards won per film. .pull-left[ ```r box_office_vs_awards <- my_helper_dm %>% dm_zoom_to(pixar_films) %>% left_join( box_office, select = c(film, box_office_worldwide) ) %>% left_join(academy_won) %>% transmute( rating = film_rating, box_office_mln = box_office_worldwide / 1e6, n_awards_won = coalesce(n_awards_won, 0) ) %>% pull_tbl() %>% collect() ``` ] .pull-right[ ```r box_office_vs_awards ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 26 × 3</span> <span style='font-weight: bold;'>rating</span> <span style='font-weight: bold;'>box_office_mln</span> <span style='font-weight: bold;'>n_awards_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> G 632. 1 <span style='color: #BCBCBC;'> 2</span> G 871. 1 <span style='color: #BCBCBC;'> 3</span> PG 632. 2 <span style='color: #BCBCBC;'> 4</span> G 624. 1 <span style='color: #BCBCBC;'> 5</span> G 521. 1 <span style='color: #BCBCBC;'> 6</span> PG 735. 2 <span style='color: #BCBCBC;'> 7</span> G <span style='text-decoration: underline;'>1</span>067. 2 <span style='color: #BCBCBC;'> 8</span> PG 539. 1 <span style='color: #BCBCBC;'> 9</span> PG 858. 1 <span style='color: #BCBCBC;'>10</span> PG 807. 2 <span style='color: #949494;'># … with 16 more rows</span> </CODE></PRE> ] --- # Consuming with dm ```r ggplot(box_office_vs_awards, aes(x = box_office_mln, y = n_awards_won)) + geom_smooth() + geom_point() + facet_wrap(vars(rating)) + theme_bw(20) ``` <!-- --> --- background-image: url("data:image/png;base64,#images/24-frame.webp") background-size: 40% background-position: 100% 100% # The whole game: Exercises .pull-left[ - Experiment ] --- # Recap <table> <tr> <td rowspan=2> <img src="data:image/png;base64,#images/11-frame.webp" width="200px" /> </td> <td><img src="data:image/png;base64,#images/12-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/12_2-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/13-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/14-frame.webp" width="200px" /></td> </tr> <tr style="background:transparent"> <td><img src="data:image/png;base64,#images/21-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/22-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/23-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/24-frame.webp" width="200px" /></td> </tr> </table> --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions